Analysis of Similarity/Dissimilarity of DNA Sequences Based on Chaos Game Representation
نویسندگان
چکیده
and Applied Analysis 3 Table 1: The coding sequences of the first exon of β-globin gene of different species. Species Coding sequence ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGT Human TACTGCCCTGTGGGGCAAGGTGAACGTGGATTAAG TTGGTGGTGAGGCCCTGGGCAG ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGG Goat CTTCTGGGGCAAGGTGAAAGTGGATGAAGTTGGTG CTGAGGCCCTGGGCAG ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCA Opossum TCACTACCATCTGGTCTAAGGTGCAGGTTGACCA GACTGGTGGTGAGGCCCTTGGCAG ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCAT Gallus CACCGGCCTCTGGGGGAAGGTCAATGTGGCCGAAT GTGGGGCCGAAGCCCTGGCCAG ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGT Lemur CACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAG TTGGTGGCGAGGCCTTGGGCAG ATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTG Mouse TCTCTTGCCTGTGGGCAAAGGTGAACCCCGATGAA GTTGGTGGTGAGGCCCTGGGCAGG ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGT Rabbit CACTGCCCTGTGGGGCAAGGTGAATGTGGAAGAAG TTGGTGGTGAGGCCCTGGGC ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGT Rat TAGTGGCCTGTGGGGAAAGGTGAACCCTGATAATG TTGGCGCTGAGGCCCTGGGCAG ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGT Gorilla TACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG TTGGTGGTGAGGCCCTGGGCAGG Table 2: Hurst exponent of the CGR-walk sequence {X n } of the nine species in Table 1. Human Goat Opossum Gallus Lemur Mouse Rabbit Rat Gorilla H(XRY n ) 0.445 0.5024 0.6536 0.5075 0.5016 0.538 0.429 0.5791 0.4698 H(XMK n ) 0.7452 0.7853 0.6547 0.7212 0.7487 0.7094 0.8099 0.5237 0.7467 H(XWS n ) 0.641 0.6894 0.6292 0.5756 0.6753 0.8118 0.615 0.7255 0.6302 3. Numerical Characterization of DNA Sequences Researchers from computer science and mathematics have been attracted to study the comparison of DNA sequences. As pointed out in references [13, 16–28], some related work has made progress. Now, we may represent a DNA sequence by a random numerical sequence based on CGR-walk technique. Gao and Xu [29] also substantially corroborated the results that longrange correlations are uncovered remarkably in the data. In this paper, we explore the tendency of a series of data by calculating the hurst exponent [30]. And some work has been done to study the relation between long-range correlation and hurst exponent [31]. In order to numerically characterize a DNA sequence given by the CGR, we treat the hurst exponent as the efficient invariant that is sensitive to this kind of graphical representation. Because a DNA sequence can be regarded as an ordered set of alphabet N = (A, C, G, T), we represent a DNA sequence as a finite set with N elements, denoted as [i] := {1, 2, . . . , N}. For any time series {u i } i=1 , one candefine several quantities as follows [30]: (i) the partial mean
منابع مشابه
A probabilistic measure for alignment-free sequence comparison
MOTIVATION Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. RESULT...
متن کاملAnalysis of similarity/dissimilarity of DNA sequences based on adjacent nucleotide pair representation
Introduction of graphic representation for nucleotide or protein sequences can provide intuitive overall pictures as well as useful insights for performing large-scale similarity analysis. In this paper, we are analyzing the similarity/dissimilarity of the mitochondrial genome sequences from twenty four mammal species. The analysis is important in finding the relatedness among the species and e...
متن کاملSelf-Similarity Limits of Genomic Signatures
It is shown that metric representation of DNA sequences is one-to-one. By using the metric representation method, suppression of nucleotide strings in the DNA sequences is determined. For a DNA sequence, an optimal string length to display genomic signature in chaos game representation is obtained by eliminating effects of the finite sequence. The optimal string length is further shown as a sel...
متن کاملA novel method to reconstruct phylogeny tree based on thechaos game representation
We developed a new approach for the reconstruction of phylogeny trees based on the chaos game representation (CGR) of biological sequences. The chaos game representation (CGR) method generates a picture from a biological sequence, which displays both local and global patterns. The quantitative index of the biological sequence is extracted from the picture. The Kullback-Leibler discrimination in...
متن کاملEncoding DNA sequences by integer chaos game representation
Motivation: DNA sequences are fundamental for encoding genetic information. The genetic information may be understood not only by symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA seque...
متن کامل